Libraries

library(tidyverse)
library(lubridate)
library(stringr)
library(ggplot2)
library(ggridges)
library(htmltools)
library(bslib)
library(plotly)
library(dplyr)
library(grDevices)
library(usmap)
library(kableExtra)

library(showtext)     # <-- ADD
showtext_auto()       # <-- ADD
font_add_google("Playfair Display", "playfair")   # <-- ADD
shopping <- read_csv("proper_raw_shopping_behavior.csv")
shopping

Summary

This project explores shopping behavior using a dataset from Kaggle, containing data from 3,900 U.S. shoppers. The analysis focused on spending patterns, demographic trends, and seasonal or regional differences in purchasing habits. The goal was to identify which groups shop the most, which categories draw the highest spending, and how patterns shift or differ based on location and time of year.

After exploratory data analysis was performed, clear trends appeared such as mid-range spending, consistent seasonal fluctuations, and noticeable imbalances in gender representation. The results offer a broad picture of who shops, what they prefer, and where the most activity occurs, creating a foundation for understanding customer behavior in a simplified retail environment.

Purpose

The goal of this project was to explore how demographic and seasonal factors influence shopping behavior. The main questions to be answered were: Who spends the most? Which products and colors are most popular? How do location and season affect purchase trends? This analysis helps show how businesses might better understand consumer preferences and plan marketing strategies.

Data

  1. Dataset Description & Features

The dataset came from Kaggle and contains 3,900 U.S.-based customer records with 18 total attributes describing shopping patterns, preferences, and demographics. This project focused on 10 of those variables to highlight specific spending and behavioral trends rather than analyzing every available feature.

The variables used were:

variables <- data.frame(
  Variable = c("Gender", "Age", "Category", "Color", "Purchase_Amount_(USD)",
               "Payment_Method", "Location", "Season"),
  Type = c("Categorical", "Numeric", "Categorical", "Categorical", "Numeric",
           "Categorical", "Categorical", "Categorical"),
  Description_Unit = c(
    "Male or Female",
    "Age of customer in years",
    "Product category (ex: Clothing, Footwear, etc.)",
    "Product color purchased",
    "Purchase amount in U.S. dollars",
    "Method used for payment (Debit/Credit Card, PayPal, etc.)",
    "Customer’s U.S. state",
    "Season of purchase (Spring, Summer, Fall, Winter)"
  )
)

kable(variables, col.names = c("Variable", "Type", "Description / Unit")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) %>%
  row_spec(0, bold = TRUE, color = "#fff", background = "#cc0099")
Variable Type Description / Unit
Gender Categorical Male or Female
Age Numeric Age of customer in years
Category Categorical Product category (ex: Clothing, Footwear, etc.)
Color Categorical Product color purchased
Purchase_Amount_(USD) Numeric Purchase amount in U.S. dollars
Payment_Method Categorical Method used for payment (Debit/Credit Card, PayPal, etc.)
Location Categorical Customer’s U.S. state
Season Categorical Season of purchase (Spring, Summer, Fall, Winter)

These variables provided a good balance between numeric and categorical data that helped explore spending behavior, preferences, and demographic patterns.

  1. Missing & Null Values There were no major missing or null values in the dataset. The data was already clean, with complete entries for all selected columns. Since the dataset was in good shape, there was no need to make any corrections or fill in missing values before analysis. This made it easier to jump straight into exploring patterns and building visuals.

  2. Methodology and Changes The dataset was already in a tidy format, each row clearly represented a single purchase record and each column represented a single attribute. tidyverse and dplyr libraries were used for summarizing, filtering, and visualizing data. Functions like count(), mutate(), and summarize() were also used to help create comparisons between gender, categories, and locations.

When visualizing, the reorder() function was used with ggplot2 to organize bars and boxplots in a more readable way. Other than that, the dataset didn’t require any major structural changes, so the main focus was on exploring relationships and identifying trends in spending and shopping behavior.

Exploratory Data Analysis

a) Summary Stats & Overview

summary(shopping)
  Customer ID          Age           Gender          Item_Purchased       Category        
 Min.   :   1.0   Min.   :18.00   Length:3900        Length:3900        Length:3900       
 1st Qu.: 975.8   1st Qu.:31.00   Class :character   Class :character   Class :character  
 Median :1950.5   Median :44.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1950.5   Mean   :44.07                                                           
 3rd Qu.:2925.2   3rd Qu.:57.00                                                           
 Max.   :3900.0   Max.   :70.00                                                           
 Purchase_Amount_(USD)   Location             Size              Color          
 Min.   : 20.00        Length:3900        Length:3900        Length:3900       
 1st Qu.: 39.00        Class :character   Class :character   Class :character  
 Median : 60.00        Mode  :character   Mode  :character   Mode  :character  
 Mean   : 59.76                                                                
 3rd Qu.: 81.00                                                                
 Max.   :100.00                                                                
    Season          Review_Rating  Subscription_Status Shipping_Type     
 Length:3900        Min.   :2.50   Length:3900         Length:3900       
 Class :character   1st Qu.:3.10   Class :character    Class :character  
 Mode  :character   Median :3.70   Mode  :character    Mode  :character  
                    Mean   :3.75                                         
                    3rd Qu.:4.40                                         
                    Max.   :5.00                                         
 Discount_Applied   Promo_Code_Used    Previous_Purchases Payment_Method    
 Length:3900        Length:3900        Min.   : 1.00      Length:3900       
 Class :character   Class :character   1st Qu.:13.00      Class :character  
 Mode  :character   Mode  :character   Median :25.00      Mode  :character  
                                       Mean   :25.35                        
                                       3rd Qu.:38.00                        
                                       Max.   :50.00                        
 Frequency of Purchases
 Length:3900           
 Class :character      
 Mode  :character      
                       
                       
                       
  • The summary statistics gives a quick view of the dataset’s structure. Customer ages range from 18 to 70, with a median of 44, showing that the data is frequently around middle-aged shoppers. Purchase amounts range from 20 dollars to 100, with a median of 60 dollars and a mean of about $59.8, which supports the idea that most spending falls in a steady mid-range instead of at the extreme ends. Previous purchases show a wide distribution (1 to 50), suggesting customers vary a lot in how frequently they shop. There is a mix of categorical and numerical variables. Overall, the summary shows that the dataset is complete (no missing values shown), balanced, and structured in a way that supports a straightforward analysis.

Histogram for Age distribution 👶 👴

ggplot(shopping, aes(x = Age)) + 
  geom_histogram(binwidth = 5, fill = "maroon3", color = "thistle1") + 
  labs(
    title = "Distribution of Customer Ages", 
    x = "Age", 
    y = "Count"
  ) +  
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )

  • This histogram shows the distribution of the ages of the customers from the dataset. It can be seen that there are roughly two peaks, the first around 20-30 and the second between 50 and 60. This suggests that the main groups of shoppers are either young or old. There are fewer middle-aged shoppers in this population.

Histogram for purchase amount distribution 🛒💲

ggplot(shopping, aes(x = `Purchase_Amount_(USD)` )) + 
  geom_histogram(binwidth = 5,  fill = "maroon3", color = "thistle1") + 
  labs(
    title = "Distribution of Purchase Amounts", 
    x = "Purchase Amount in Dollars", 
    y = "Number of Purchases"
  ) +  
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )

  • This bar chart shows the distribution of purchase amounts from the shopping data. There’s two or three peaks, one around the 30–35 dollar range and another near 90–95 dollars. This means that customers tend to spend either on the lower or higher end, with fewer purchases falling in the mid-range.

Box Plot for Purchase Amount by Product Category

shopping %>%
  ggplot(aes(x = reorder(Category, `Purchase_Amount_(USD)`, median), 
             y = `Purchase_Amount_(USD)`, fill = Category)) +
  geom_boxplot(show.legend = FALSE, alpha = 0.6) +
  coord_flip() +
  labs(
    title = "Purchase Amount by Product Category",
    subtitle = "Outerwear and Clothing show slightly higher spending ranges.",
    x = "Product Category",
    y = "Purchase Amount in Dollars"
  ) + 
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )

  • This boxplot shows how spending looks across the main product categories and helps analyze which areas customers spend their money on the most. Footwear, Clothing, and Accessories all have similar median purchase amounts, but their spreads are different, which means some categories have more variation in what people are willing to spend. Outerwear has the widest spread, suggesting customers either buy very low-cost basics or pricier pieces depending on the style or season. This visual helps support the overall goal of the project by showing which categories have the most consistent spending and where customers tend to shop across a wider price range.

Average Spending by Gender

shopping %>%
  group_by(Gender) %>%
  summarize(
    avg_spending = mean(`Purchase_Amount_(USD)`, na.rm = TRUE),
    total_spending = sum(`Purchase_Amount_(USD)`)
  )

shopping %>%
  group_by(Gender) %>%
  summarize(avg_spending = mean(`Purchase_Amount_(USD)`, na.rm = TRUE)) %>%
  ggplot(aes(x = Gender, y = avg_spending, fill = Gender)) +
  geom_col(alpha = 0.8) +
  scale_fill_manual(values = c("Female" = "#cc0099", "Male" = "lightblue")) +
  labs(
    title = "Average Spending by Gender",
    subtitle = "Shows the mean purchase amount across all transactions.",
    x = "Gender",
    y = "Average Spending (USD)"
  ) +
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    legend.position = "none"
  )

  • Even though male customers made more total purchases, their higher totals are mostly due to how heavily the dataset is skewed toward men. When comparing averages instead of totals, female shoppers actually spent slightly more per purchase. This contrast shows exactly how important it is to look at both totals and averages together, since totals reflect sample size while averages reflect behavior. So even though men appear to spend more overall, women are actually just as active (or even a bit more generous) on a per-purchase basis.

Total Spending by Gender

shopping %>%
  group_by(Gender) %>%
  summarize(total_spending = sum(`Purchase_Amount_(USD)`, na.rm = TRUE)) %>%
  ggplot(aes(x = Gender, y = total_spending, fill = Gender)) +
  geom_col(alpha = 0.8) +
  scale_fill_manual(values = c("Female" = "#cc0099", "Male" = "lightblue")) +
  labs(
    title = "Total Spending by Gender",
    subtitle = "Shows the combined purchase amount across all customers.",
    x = "Gender",
    y = "Total Spending (USD)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    legend.position = "none"
  )

  • This chart compares the total amount spent by male and female customers. Male shoppers clearly lead in total spending, but that’s mostly because there are more of them in the dataset. Even though men made more purchases overall, it doesn’t necessarily mean they spend more per purchase, just that their larger numbers add up to a higher total.

c) Product & Color Preferences

Product Category by Gender

shopping %>%
  count(Gender, Category) %>%
  ggplot(aes(x = Category, y = n, fill = Gender)) +
  geom_col(position = "dodge", alpha = 0.7) +
  scale_fill_manual(values = c("Female" = "#cc0099", "Male" = "lightblue")) +
  labs(title = "Product Categories by Gender",
       x = "Category", y = "Number of Purchases") + 
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )

  • This chart compares product category purchases by gender. It is clear that the sample does not reflect the population. Clothing and accessories are the most popular across both groups, but males consistently make more purchases overall in every category. The gap is very noticeable in all categories, but especially clothing and footwear. This reflects the dataset’s imbalance, since it contains more male entries – meaning the results noticeably show shopping behavior within this sample rather than representing the broader population.

Top 5 Colors by Gender

color_by_gender <- shopping %>%
  count(Gender, Color, name = "Occurrences") %>%
  group_by(Gender) %>%
  slice_max(order_by = Occurrences, n = 5)

ggplot(color_by_gender, aes(x = reorder(Color, Occurrences),
                            y = Occurrences, fill = Gender)) +
  geom_col(position = "identity", alpha = 0.7) + #this position = "identity" part was so helpful, before the graph was so weird and the sizes of the bars were very off, but this fixed it so that they would just overlap. The problem was that the bars were trying to stay the same size even though they were overlapping.
  scale_fill_manual(values = c("Female" = "#cc0099", "Male" = "lightblue")) +
  labs(
    title = "Top 5 Colors by Gender",
    subtitle = "Transparent overlap highlights shared color preferences.",
    x = "Color",
    y = "Number of Purchases"
  ) +
 theme_minimal(base_size = 12) +
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    legend.text = element_text(family = "playfair", size = 10),
    legend.title = element_text(family = "playfair", face = "bold", size = 11),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

  • This chart compares the top color choices between male and female shoppers. Pink, magenta, and green are most popular among women, while men lean toward cooler tones like teal, cyan, and silver. You’d usually expect more overlap between the groups, but since this dataset has more male entries overall and came significantly tidy, it likely skews the color rankings. Even with that imbalance, yellow and olive still show up as shared favorites across both groups, which might represent more neutral or universally appealing tones.

Top 5 Colors by Product Category

top_colors_by_cat <- shopping %>%
  count(Category, Color, name = "Occurances") %>%
  group_by(Category) %>%
  slice_max(order_by = Occurances, n=5)

top_colors_by_cat %>%
  ggplot(aes(x = reorder(Color, Occurances), 
             y = Occurances, 
             fill = Category)) +
  geom_col(show.legend = FALSE, color = "thistle1", alpha = 0.8) +
  coord_flip() +
  facet_wrap(~ Category, scales = "free_y") +
  labs(
    title = "Top 5 Colors by Product Category",
    subtitle = "Most frequently purchased colors within each category.",
    x = "Color",
    y = "Number of Occurrences"
  ) +
  scale_fill_manual(values = rep("#cc0099", length(unique(top_colors_by_cat$Category)))) + 
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )

  • This chart shows the five most frequently purchased colors within each product category. Accessories lean toward neutral and earthy tones like Olive and Gray, while Clothing includes deeper shades like Teal and Maroon. Footwear trends toward softer colors, and Outerwear favors cooler ones such as Blue and Violet.

Results

Conclusions

References

---
title: "Deliverable 5"
author: "｡⋆˚⭑⟡✴︎ Kelby MK Palmer ✴︎⟡⭑˚⋆｡"
date: "November 9, 2025"
output: 
  html_notebook: 
    theme: 
      version: 4
      bg: "#ffe6ff"
      fg: "#cc0099"
      primary: "#6b2b1a"
      base_font: 
        google: "Source Sans Pro"
      heading_font: 
        google: "Playfair Display"
---


# Libraries
```{r}
library(tidyverse)
library(lubridate)
library(stringr)
library(ggplot2)
library(ggridges)
library(htmltools)
library(bslib)
library(plotly)
library(dplyr)
library(grDevices)
library(usmap)
library(kableExtra)

library(showtext)     # <-- ADD
showtext_auto()       # <-- ADD
font_add_google("Playfair Display", "playfair")   # <-- ADD


```

```{r}
shopping <- read_csv("proper_raw_shopping_behavior.csv")
shopping
```


# Summary
This project explores shopping behavior using a dataset from Kaggle, containing data from 3,900 U.S. shoppers. The analysis focused on spending patterns, demographic trends, and seasonal or regional differences in purchasing habits. The goal was to identify which groups shop the most, which categories draw the highest spending, and how patterns shift or differ based on location and time of year.

After exploratory data analysis was performed, clear trends appeared such as mid-range spending, consistent seasonal fluctuations, and noticeable imbalances in gender representation. The results offer a broad picture of who shops, what they prefer, and where the most activity occurs, creating a foundation for understanding customer behavior in a simplified retail environment.


# Purpose
The goal of this project was to explore how demographic and seasonal factors influence shopping behavior. The main questions to be answered were: Who spends the most? Which products and colors are most popular? How do location and season affect purchase trends? This analysis helps show how businesses might better understand consumer preferences and plan marketing strategies.

# Data

a. Dataset Description & Features

The dataset came from Kaggle and contains 3,900 U.S.-based customer records with 18 total attributes describing shopping patterns, preferences, and demographics. This project focused on 10 of those variables to highlight specific spending and behavioral trends rather than analyzing every available feature.

The variables used were:
```{r}
variables <- data.frame(
  Variable = c("Gender", "Age", "Category", "Color", "Purchase_Amount_(USD)",
               "Payment_Method", "Location", "Season"),
  Type = c("Categorical", "Numeric", "Categorical", "Categorical", "Numeric",
           "Categorical", "Categorical", "Categorical"),
  Description_Unit = c(
    "Male or Female",
    "Age of customer in years",
    "Product category (ex: Clothing, Footwear, etc.)",
    "Product color purchased",
    "Purchase amount in U.S. dollars",
    "Method used for payment (Debit/Credit Card, PayPal, etc.)",
    "Customer’s U.S. state",
    "Season of purchase (Spring, Summer, Fall, Winter)"
  )
)

kable(variables, col.names = c("Variable", "Type", "Description / Unit")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) %>%
  row_spec(0, bold = TRUE, color = "#fff", background = "#cc0099")
```


These variables provided a good balance between numeric and categorical data that helped explore spending behavior, preferences, and demographic patterns.

b. Missing & Null Values
There were no major missing or null values in the dataset. The data was already clean, with complete entries for all selected columns. Since the dataset was in good shape, there was no need to make any corrections or fill in missing values before analysis. This made it easier to jump straight into exploring patterns and building visuals.

c. Methodology and Changes
The dataset was already in a tidy format, each row clearly represented a single purchase record and each column represented a single attribute. tidyverse and dplyr libraries were used for summarizing, filtering, and visualizing data. Functions like count(), mutate(), and summarize() were also used to help create comparisons between gender, categories, and locations.

When visualizing, the reorder() function was used with ggplot2 to organize bars and boxplots in a more readable way. Other than that, the dataset didn’t require any major structural changes, so the main focus was on exploring relationships and identifying trends in spending and shopping behavior.

# **Exploratory Data Analysis**

## a) Summary Stats & Overview

```{r}
summary(shopping)
```
- The summary statistics gives a quick view of the dataset's structure. Customer ages range from 18 to 70, with a median of 44, showing that the data is frequently around middle-aged shoppers. Purchase amounts range from 20 dollars to 100, with a median of 60 dollars and a mean of about $59.8, which supports the idea that most spending falls in a steady mid-range instead of at the extreme ends. Previous purchases show a wide distribution (1 to 50), suggesting customers vary a lot in how frequently they shop. There is a mix of categorical and numerical variables. Overall, the summary shows that the dataset is complete (no missing values shown), balanced, and structured in a way that supports a straightforward analysis.


### **Histogram** for _Age distribution_ 👶 👴 
```{r}
ggplot(shopping, aes(x = Age)) + 
  geom_histogram(binwidth = 5, fill = "maroon3", color = "thistle1") + 
  labs(
    title = "Distribution of Customer Ages", 
    x = "Age", 
    y = "Count"
  ) +  
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )
```
- This histogram shows the distribution of the ages of the customers from the dataset. It can be seen that there are roughly two peaks, the first around 20-30 and the second between 50 and 60. This suggests that the main groups of shoppers are either young or old. There are fewer middle-aged shoppers in this population.

### **Histogram** for _purchase amount distribution_ 🛒💲
```{r}
ggplot(shopping, aes(x = `Purchase_Amount_(USD)` )) + 
  geom_histogram(binwidth = 5,  fill = "maroon3", color = "thistle1") + 
  labs(
    title = "Distribution of Purchase Amounts", 
    x = "Purchase Amount in Dollars", 
    y = "Number of Purchases"
  ) +  
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )
```
- This bar chart shows the distribution of purchase amounts from the shopping data. There's two or three peaks, one around the 30–35 dollar range and another near 90–95 dollars. This means that customers tend to spend either on the lower or higher end, with fewer purchases falling in the mid-range.

### **Box Plot** for _Purchase Amount by Product Category_ 

```{r}
shopping %>%
  ggplot(aes(x = reorder(Category, `Purchase_Amount_(USD)`, median), 
             y = `Purchase_Amount_(USD)`, fill = Category)) +
  geom_boxplot(show.legend = FALSE, alpha = 0.6) +
  coord_flip() +
  labs(
    title = "Purchase Amount by Product Category",
    subtitle = "Outerwear and Clothing show slightly higher spending ranges.",
    x = "Product Category",
    y = "Purchase Amount in Dollars"
  ) + 
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )
```
- This boxplot shows how spending looks across the main product categories and helps analyze which areas customers spend their money on the most. Footwear, Clothing, and Accessories all have similar median purchase amounts, but their spreads are different, which means some categories have more variation in what people are willing to spend. Outerwear has the widest spread, suggesting customers either buy very low-cost basics or pricier pieces depending on the style or season. This visual helps support the overall goal of the project by showing which categories have the most consistent spending and where customers tend to shop across a wider price range.


## b) Shopping Behavior & Spending Trends

### Payment Methods Distribution   💳 💵 📱
```{r}
shopping %>%
  count(Payment_Method) %>%
  mutate(pct = n / sum(n)) %>%
  ggplot(aes(x = reorder(Payment_Method, pct), y = pct, fill = Payment_Method)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = scales::percent(pct, accuracy = 0.1)), vjust = -0.3) +
  labs(
    title = "Distribution of Payment Methods",
    subtitle = "Majority of purchases were made via Credit Card and PayPal.",
    x = "Payment Method",
    y = "Share of Total Transactions"
  ) +
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )
```
- This bar chart shows that Credit Card and PayPal are the most common payment methods. Bank Transfer is used the least, likely due to its slower processing and the lack of rewards, as compared to credit cards. Bank transfers also tend to involve more steps and security measures than digital payments, so it makes sense that shoppers prefer faster, one-click options like PayPal and credit cards.

## Average Spending by Gender
```{r}
shopping %>%
  group_by(Gender) %>%
  summarize(
    avg_spending = mean(`Purchase_Amount_(USD)`, na.rm = TRUE),
    total_spending = sum(`Purchase_Amount_(USD)`)
  )

shopping %>%
  group_by(Gender) %>%
  summarize(avg_spending = mean(`Purchase_Amount_(USD)`, na.rm = TRUE)) %>%
  ggplot(aes(x = Gender, y = avg_spending, fill = Gender)) +
  geom_col(alpha = 0.8) +
  scale_fill_manual(values = c("Female" = "#cc0099", "Male" = "lightblue")) +
  labs(
    title = "Average Spending by Gender",
    subtitle = "Shows the mean purchase amount across all transactions.",
    x = "Gender",
    y = "Average Spending (USD)"
  ) +
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    legend.position = "none"
  )

```
- Even though male customers made more total purchases, their higher totals are mostly due to how heavily the dataset is skewed toward men. When comparing averages instead of totals, female shoppers actually spent slightly more per purchase. This contrast shows exactly how important it is to look at both totals and averages together, since totals reflect sample size while averages reflect behavior. So even though men appear to spend more overall, women are actually just as active (or even a bit more generous) on a per-purchase basis.


### Total Spending by Gender
```{r}
shopping %>%
  group_by(Gender) %>%
  summarize(total_spending = sum(`Purchase_Amount_(USD)`, na.rm = TRUE)) %>%
  ggplot(aes(x = Gender, y = total_spending, fill = Gender)) +
  geom_col(alpha = 0.8) +
  scale_fill_manual(values = c("Female" = "#cc0099", "Male" = "lightblue")) +
  labs(
    title = "Total Spending by Gender",
    subtitle = "Shows the combined purchase amount across all customers.",
    x = "Gender",
    y = "Total Spending (USD)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    legend.position = "none"
  )
```
- This chart compares the total amount spent by male and female customers. Male shoppers clearly lead in total spending, but that’s mostly because there are more of them in the dataset. Even though men made more purchases overall, it doesn’t necessarily mean they spend more per purchase, just that their larger numbers add up to a higher total.


## c) Product & Color Preferences

### Product Category by Gender
```{r}
shopping %>%
  count(Gender, Category) %>%
  ggplot(aes(x = Category, y = n, fill = Gender)) +
  geom_col(position = "dodge", alpha = 0.7) +
  scale_fill_manual(values = c("Female" = "#cc0099", "Male" = "lightblue")) +
  labs(title = "Product Categories by Gender",
       x = "Category", y = "Number of Purchases") + 
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )
```
- This chart compares product category purchases by gender. It is clear that **the sample does not reflect the population.** Clothing and accessories are the most popular across both groups, but males consistently make more purchases overall in every category. The gap is very noticeable in all categories, but especially clothing and footwear. This reflects the dataset’s imbalance, since it contains more male entries -- meaning the results noticeably show shopping behavior within this sample rather than representing the broader population.


### Top 5 Colors by Gender
```{r}
color_by_gender <- shopping %>%
  count(Gender, Color, name = "Occurrences") %>%
  group_by(Gender) %>%
  slice_max(order_by = Occurrences, n = 5)

ggplot(color_by_gender, aes(x = reorder(Color, Occurrences),
                            y = Occurrences, fill = Gender)) +
  geom_col(position = "identity", alpha = 0.7) + #this position = "identity" part was so helpful, before the graph was so weird and the sizes of the bars were very off, but this fixed it so that they would just overlap. The problem was that the bars were trying to stay the same size even though they were overlapping.
  scale_fill_manual(values = c("Female" = "#cc0099", "Male" = "lightblue")) +
  labs(
    title = "Top 5 Colors by Gender",
    subtitle = "Transparent overlap highlights shared color preferences.",
    x = "Color",
    y = "Number of Purchases"
  ) +
 theme_minimal(base_size = 12) +
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    legend.text = element_text(family = "playfair", size = 10),
    legend.title = element_text(family = "playfair", face = "bold", size = 11),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )
```
- This chart compares the top color choices between male and female shoppers. Pink, magenta, and green are most popular among women, while men lean toward cooler tones like teal, cyan, and silver. You’d usually expect more overlap between the groups, but since this dataset has more male entries overall and came significantly tidy, it likely skews the color rankings. Even with that imbalance, yellow and olive still show up as shared favorites across both groups, which might represent more neutral or universally appealing tones.

### Top 5 Colors by Product Category
```{r}
top_colors_by_cat <- shopping %>%
  count(Category, Color, name = "Occurances") %>%
  group_by(Category) %>%
  slice_max(order_by = Occurances, n=5)

top_colors_by_cat %>%
  ggplot(aes(x = reorder(Color, Occurances), 
             y = Occurances, 
             fill = Category)) +
  geom_col(show.legend = FALSE, color = "thistle1", alpha = 0.8) +
  coord_flip() +
  facet_wrap(~ Category, scales = "free_y") +
  labs(
    title = "Top 5 Colors by Product Category",
    subtitle = "Most frequently purchased colors within each category.",
    x = "Color",
    y = "Number of Occurrences"
  ) +
  scale_fill_manual(values = rep("#cc0099", length(unique(top_colors_by_cat$Category)))) + 
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )

```
- This chart shows the five most frequently purchased colors within each product category. Accessories lean toward neutral and earthy tones like Olive and Gray, while Clothing includes deeper shades like Teal and Maroon. Footwear trends toward softer colors, and Outerwear favors cooler ones such as Blue and Violet.

## d) Regional & Seasonal Trends

### Top 10 Customer Locations 📍 🗺️
```{r}
shopping %>%
  count(Location, name = "Purchases") %>%
  arrange(desc(Purchases)) %>%
  slice_head(n = 10) %>%
  ggplot(aes(x = reorder(Location, Purchases), y = Purchases)) +
  geom_col(fill = "#cc0099") +
  coord_flip() +
  labs(title = "Top 10 Customer Locations by Purchase Volume",
       x = "Location", y = "Number of Purchases") + 
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )
```
- This bar chart shows where most customers are shopping from. Montana and California have the most total purchases, with states like Idaho, Illinois, and Alabama close behind. The distribution looks fairly balanced overall, with normal variance between states that could be because of differences in anything from population to marketing strategies in each location.

### U.S. Map (Purchase Volume by State)
```{r}
# to load the font
showtext_auto()
font_add_google("Playfair Display", "playfair")

# sum the state-level purchases
state_purchases <- shopping %>%
  count(Location, name = "Purchases") %>%
  rename(state = Location) %>%
  mutate(code = state.abb[match(state, state.name)])  # get state abbreviations

# make interactive map
us_shopping_map <- plot_geo(state_purchases, locationmode = "USA-states") %>%
  add_trace(
    locations = ~code,
    z = ~Purchases,
    text = ~paste0(
      "<b>", state, "</b><br>",
      Purchases, " purchases 🛍️✨💖" #this should be shown on hover
    ),
    hoverinfo = "text",
    colorscale = list(c(0, 1), c("#f8d6e0", "#cc0099")),
    marker = list(line = list(color = "white", width = 1))
  ) %>%
  colorbar(title = "Purchases") %>%
  layout(
    title = list(
      text = "<b>Customer Purchase Volume by State</b><br><span style='font-size:12px;'>Hover to explore regional activity</span>",
      font = list(family = "Playfair Display", size = 20)
    ),
    geo = list(
      scope = "usa",
      projection = list(type = "albers usa"),
      bgcolor = "lightblue",          # background
      lakecolor = "#e9f2f9",
      showlakes = TRUE
    ),
    font = list(family = "Playfair Display")
  )

us_shopping_map
```
- This map was added as an exploratory visual to get a broad sense of how shopping activity varies across the country. The darker pink shades show higher purchase volume, which makes states like Montana, California, and Illinois stand out a bit more. Even though the map is general and only shows state-level totals, it still gets the point across: most states have pretty steady, moderate shopping levels, with just a few that lean higher. It was mainly included to explore overall patterns, not to make exact or perfectly predictive geographic results.

### Total Spending by Season ☀️ ❄️ 🌷 🍂

```{r}
shopping %>%
  group_by(Season) %>%
  summarise(total_spending = sum(`Purchase_Amount_(USD)`)) %>%
  ggplot(aes(x = Season, y = total_spending, fill = Season)) +
  geom_col() +
  scale_fill_manual(values = c("coral", "maroon2", "gold", "lightblue")) +
  labs(title = "Total Spending by Season", x = "Season", y = "Total Spending (in Dollars)") + 
  theme(
    text = element_text(family = "playfair"),
    plot.title = element_text(family = "playfair", face = "bold", size = 16),
    plot.subtitle = element_text(family = "playfair", size = 11),
    axis.title = element_text(family = "playfair", size = 12),
    axis.text = element_text(family = "playfair", size = 10),
    panel.grid.minor = element_blank()
  )

```
- This chart shows how total spending changes across the seasons. Fall has the highest spending overall, while Summer has the lowest. Spring and Winter are in the middle and are fairly close to each other, leaning more toward the higher levels seen in Fall. All together, it gives a simple look at when shoppers tend to spend the most and least throughout the year.


### Top 5 Colors by Season

```{r}
top_colors_by_season <- shopping %>%
  count(Season, Color, name = "Occurrences") %>%
  group_by(Season) %>%
  slice_max(order_by = Occurrences, n = 5)

ggplot(top_colors_by_season, aes(x = Color, y = Occurrences, fill = Season)) +
  geom_col(position = "dodge") +
  facet_wrap(~Season, scales = "free_x") +
  scale_fill_manual(values = c(
    "Fall" = "coral",
    "Spring" = "maroon2",
    "Summer" = "gold",
    "Winter" = "lightblue"
  )) +
  labs(
    title = "Top 5 Colors in Each Season",
    x = "Color",
    y = "Occurrences"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    text = element_text(family = "playfair"),                 # apply font
    plot.title = element_text(family = "playfair", face = "bold", size = 14),
    plot.subtitle = element_text(family = "playfair"),
    axis.title = element_text(family = "playfair"),
    axis.text = element_text(family = "playfair", angle = 45, hjust = 1, size = 9),
    strip.text = element_text(family = "playfair", face = "bold")
  )

```
- This chart shows the top five colors in each season and gives a quick feel for how shopper preferences shift throughout the year. Fall tends to lean warm with shades like Magenta, Olive, and Orange. Spring switches into brighter colors such as Pink and Teal. Summer has more cool tones like Silver, Teal, and Green, while Winter brings in deeper colors like Green, Peach, and Maroon. Overall, the color trends line up with the general mood of each season and help show how shopper preferences change throughout the year.


# Results
- For this project, the focus was on exploring the data visually to figure out how people shop and what patterns stand out. Descriptive statistics and graphs in ggplot2 were used to break things down by gender, category, season, location, etc. Since the dataset was already clean, there wasn't much need to do data wrangling, it was mostly grouping, summarizing, and visualizing what was already there.

- This approach was best because it lets the data speak for itself. The boxplots, bar charts, and map made it easy to find differences between groups and find trends that might not be obvious when looking at raw numbers. The color and shape differences in each chart also help show group trends. The goal was to understand who’s shopping, what they’re buying, and when and where it’s happening, and visuals helped make that story clear. 

- Overall, it was found that most people spend in a steady midrange, with digital payments like credit cards and PayPal being the most common Men made more total purchases, but that’s mostly because there are more male shoppers in the dataset. On average, women actually spend slightly more per purchase. Colors like pink, teal, and yellow stand out across both groups, and seasons like spring and fall see the most shopping activity. States like Montana, California, and Illinois came out on top for total purchase volume.



# Conclusions

- This project helped point out how all these shopping habits connect, like how gender, location, and season can influence what people buy and when. One limitation is that the dataset leans more male, which affects totals and preferences, so that’s something to consider when looking at the results. It also doesn’t include details like customer income or store type, which would’ve helped to understand why people spend the way they do. Another limitation is that while data can be analyzed by season, it would be even more helpful if the data included the exact months. That way, specific holidays or sale periods would be shown and could be analyzed to see how more specific timing impacts shopping behavior.

- Even with those limits, the trends are really useful. The results suggest that convenience-based payment methods dominate, mid-range spending is typical, and purchasing patterns are fairly consistent across seasons and geography. This may help businesses identify which products and payment methods appeal most to online shoppers.

- If this project were to be continued, next steps could include digging into how age affects spending or whether certain product categories are more popular with different age groups. Also, comparing this data to real-world sales info or state populations to see how accurate the trends are. 
- Overall, this project came together well and shows a clear story about the consumers in this dataset, what they're buying, and when and how they're doing it.



# References
- Data Visualization with ggplot2 : : CHEAT SHEET. - (n.d.). Retrieved November 10, 2025, from https://lile.duke.edu/wp-content/uploads/2020/07/R_ggplot2_cheatsheet.pdf
- Google Fonts. (n.d.). Google Fonts. https://fonts.google.com/specimen/Playfair+Display
- Hadley Wickham. (2019). ggplot2: Elegant Graphics for Data Analysis.
- Ggplot2-Book.org. https://ggplot2-book.org/Lorenzo, P. D. (2025). 
- US Maps Including Alaska and Hawaii [R package usmap version 1.0.0]. R-Project.org. https://cran.r-project.org/package=usmapPackage - “ggplot2” Title Create Elegant Data Visualisations Using the Grammar of Graphics. (2020). https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf 
- R for Data Science (2e). (n.d.). R4ds.hadley.nz. https://r4ds.hadley.nz/
- Zubaira Maimona. (2025). Shopping behaviours dataset. Kaggle.com. https://www.kaggle.com/datasets/zubairamuti/shopping-behaviours-dataset/data